1. Introduction

Diamonds have been consider a visual appealing fashionable statement for more than 500 years. We have been hired by a diamond wholesaler to better understand the diamond markert , identfy trends and build a model that can predict diamond prices for the next year.

2. Datasets Description:

To provide this in-dept analysis we have been given a data set containing sales information from 2010 to 2021.

The dataset contains 11 attributes:

  1. carat: the diamonds weight. 1 carat = 200 mg
  2. cut: rating system from 1 to 5 (poor to ideal)
  3. color: standardized color code. Each diamond has a color
  4. clarity: standardized table. Measure of any defects that can impact visual appearance. 5. depth: percentation (0 to 100) relating the diamonds depth (top to bottom) with its width 6. table: percentation (0 to 100) relating the diamonds overall width to the width of the top part
  5. price: what the diamond sold for
  6. x: length in milimeters
  7. y: width in milimeters
  8. z: height in milimeters
  9. year: year of the sale

3. Cleaning and Processing:

Upon exploring the data we quickly realized that serveral cleaning processes needed to be done.

  1. We dropped serveral null values located in the carat categories.

  2. We noticed serveral negative values within the cost(dollars) columns. It was decided that the these values would be transformed into postivie values. Are reasoning is this could have been human error when entering the data.

  3. Outliners were discovered within the length (mm), width (mm), height (mm) and cost(dollars) columns were also droped.

  4. Upon research height (mm) contained some corrupted data that could not physically be possible for a diamonds which were dropped.

3. Exploratory analysis

After cleaning the dataset we are left with 405085 observations. Down below are 7 top graded analysis to that will answering questions like:

  1. How much do diamonds cost on average? Whats the variance and distribution of prices?

  2. How many diamonds of each type of color and color are there?

  3. How many diamonds when showing the interactions of cut and color?

  4. What is the summary statistic of diamonds based each type of cut? What is the market makeup based on cut these past 10 years?

• How does the diamond cost vary with carat, year, color, and other properties? • correlations between the variables. • identify trends • Clustering • Use an off-the-shelf algorithm to see if the diamonds in your dataset can be naturally grouped into clusters.

Fun fact: Based on research and the cost(dollars) attribute we can clearing detect there is a possible cutomer groups within the dataset that my be highlighted.

At first glace key features to look at are height, width, lenght, and carat due to the high correlation to cost. However, due to external reseach on the diamond industry we will include color, clarity and cut.

Price Prediction

Running linear regression, decision tree, random forest models:

So far the random forest and decison trees both have the highest r square.

Next Step: Fitting the model to understand accuracy.

The linear regression model will be dropped due to the negative predicted price values.

Our decision tree regressor model was able to predict the price of every diamond in the test set with an error of ± 7.49% of the real price.

Our decision tree regressor model was able to predict the price of every diamond in the test set with an error of ± 7.49% of the real price.

Deploying the Model

The Random Forest model that was chosen due to the small MSE and high Rsquare.

The basic requirments to delpoy the data is now complete!